── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose() masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract() masks R.utils::extract()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::partial() masks pryr::partial()
✖ dplyr::where() masks pryr::where(), gtsummary::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
Display your machine memory.
memuse::Sys.meminfo()
Totalram: 8.000 GiB
Freeram: 54.016 MiB
In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.
Q1. Visualizing patient trajectory
Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.
Q1.1 ADT history
A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.
Do a similar visualization for the patient with subject_id 10063848 using ggplot.
Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.
Solution 1.1
system("gunzip -k ~/mimic/hosp/labevents.csv.gz")
# Write the CSV file to Parquet formatarrow::write_dataset(open_dataset("~/mimic/hosp/labevents.csv.gz", format ="csv"),path ="./labevents_pq.parque",format ="parquet")
arrow::write_dataset(open_dataset("~/mimic/icu/chartevents.csv.gz", format ="csv"),path ="./chartevents_pq.parque",format ="parquet")
# File paths to the data filespatients_file <-"~/mimic/hosp/patients.csv.gz"admissions_file <-"~/mimic/hosp/admissions.csv.gz"transfers_file <-"~/mimic/hosp/transfers.csv.gz"procedures_icd_file <-"~/mimic/hosp/procedures_icd.csv.gz"diagnoses_icd_file <-"~/mimic/hosp/diagnoses_icd.csv.gz"d_icd_procedures_file <-"~/mimic/hosp/d_icd_procedures.csv.gz"d_icd_diagnoses_file <-"~/mimic/hosp/d_icd_diagnoses.csv.gz"labevents_pq_dir <-"./labevents_pq.parque"
Warning in left_join(., d_icd_diagnoses, by = "icd_code"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 17 of `x` matches multiple rows in `y`.
ℹ Row 15793 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
diagnoses_text <-paste(head(diagnoses, 3), collapse ="\n")# Convert intime and outtime to POSIXct formattransfers_data <- transfers_data %>%mutate(intime =as.POSIXct(intime, format ="%Y-%m-%d %H:%M:%S", tz ="UTC"),outtime =as.POSIXct(outtime, format ="%Y-%m-%d %H:%M:%S", tz ="UTC"))# Prepare data for ADT events (Admission, Discharge, Transfer)adt_data <- transfers_data %>%filter(careunit !="UNKNOWN") %>%# Drop rows where careunit is "unknown"mutate(care_unit_duration =difftime(outtime, intime, units ="hours"),line_width =ifelse(grepl("icu", careunit, ignore.case =TRUE),3, 2.5))# Prepare lab event datalab_data <- labevents_data %>%group_by(charttime) %>%summarise(n =n()) %>%mutate(event ="Lab")# Prepare procedures dataprocedures_data_1 <- procedures_data %>%left_join(d_icd_procedures, by ="icd_code") %>%mutate(event =case_when(icd_code =="Central venous catheter"~"Central venous catheter", icd_code =="Other incision of brain"~"Other incision of brain",TRUE~"Spinal tap"))procedures_data_1 <- procedures_data_1 %>%mutate(chartdate =as.POSIXct(chartdate, format ="%Y-%m-%d"))
ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.
Do a similar visualization for the patient 10063848.
Solution 1.2
arrow::write_dataset(open_dataset("~/mimic/icu/procedureevents.csv.gz", format ="csv"),path ="./procedureevents_pq.parque",format ="parquet")
# Load necessary librarieslibrary(tibble)library(readr)# Import icustays.csv.gz as a tibbleicustays_tble <-read_csv("~/mimic/icu/icustays.csv.gz") %>%as_tibble()
Rows: 94458 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): first_careunit, last_careunit
dbl (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the first few rows of the tibbleprint(head(icustays_tble),10)
How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.
Solution 2.2
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:arrow':
schema
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# Count the number of ICU stays per subjecticu_stays_summary <- icustays_tble %>%group_by(subject_id) %>%summarise(icu_stay_count =n()) %>%arrange(desc(icu_stay_count))print(icu_stays_summary)
p <-ggplot(icu_stays_summary, aes(x = icu_stay_count)) +geom_bar(fill ="skyblue", color ="black") +labs(title ="Number of ICU Stays per Subject",x ="ICU Stay Count",y ="Number of Subjects" ) +theme_minimal()# Convert to an interactive plot using plotlyinteractive_plot <-ggplotly(p)# Display the plotinteractive_plot
Most subjects have a single ICU stay: The vast majority of subjects (around 50,000) had only one ICU stay. This is the tallest bar on the far left, indicating that most patients only visit the ICU once.
Gradual decrease for multiple stays: There is a sharp decline in the number of subjects as the ICU stay count increases. Fewer subjects had multiple ICU stays.
Very few subjects have more than 10 ICU stays: Beyond 10 ICU stays, the number of subjects drops to almost zero, and the graph flattens out. There are only a few outliers with an ICU stay count greater than 20, indicating that it’s extremely rare for patients to be admitted to the ICU more than 10-15 times.
Import admissions.csv.gz as a tibble admissions_tble.
Solution 3.1
# Import icustays.csv.gz as a tibbleadmissions_tble <-read_csv("~/mimic/hosp/admissions.csv.gz") %>%as_tibble()
Rows: 546028 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the first few rows of the tibbleprint(admissions_tble)
All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.
Admissions per patient
Solution 3.2
Number of admissions per patient
library(plotly)# Number of admissions per patientadmissions_per_patient <- admissions_tble %>%group_by(subject_id) %>%summarise(admission_count =n())# Static ggplot2 plotp1 <-ggplot(admissions_per_patient, aes(x = admission_count)) +geom_histogram(binwidth =2, fill ="skyblue", color ="black") +labs(title ="Number of Admissions per Patient",x ="Number of Admissions",y ="Number of Patients" ) +theme_minimal()# Convert to interactive plotggplotly(p1)
The “Number of Admissions per Patient” histogram demonstrates a highly skewed distribution, with a significant majority of patients experiencing only a single admission. Visually, the first bar on the histogram is dramatically taller than all subsequent bars, indicating a disproportionately high count of single admissions. This indicates that most hospital visits are likely for acute conditions or short-term procedures. The rapid decline in patient counts with increasing admissions points to a smaller cohort requiring multiple hospitalizations.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Convert to interactive plotggplotly(p2)
The “Admission Hour Distribution” graph reveals a notable peak in admissions at midnight, suggesting administrative or system-related influences. The midnight peak is the highest point on the graph, significantly exceeding the values observed in other hours.The spike at hour 0 is unusually high and might not represent actual patient admissions but more of a data recording anomaly. It’s common for large spikes at midnight due to default values in databases or rounding practices in electronic health records (EHRs). A decrease in admissions is observed during early morning hours, followed by increases during mid-morning and afternoon, reflecting scheduled procedures and emergency department activity. A plateau in the evening indicates sustained hospital activity, with a subsequent decline late at night.
The “Admission Minute Distribution” graph displays peculiar peaks at specific minutes (approximately 0, 15, 30, and 45). These peaks are significantly higher than the baseline values, indicating a concentration of recorded admissions at these specific minutes. These peaks are likely artifacts of data recording practices, such as rounding or default values in the electronic health record system, rather than actual variations in admission frequency.
Length of hospital stay
# Required libraries# Calculate length of stay in daysadmissions_tble <- admissions_tble %>%mutate(length_of_stay =as.numeric(difftime(dischtime, admittime, units ="days")) )# Plot a histogram for length of stay with x-axis limited to 0-100 daysplot <-ggplot(admissions_tble, aes(x = length_of_stay)) +geom_histogram(binwidth =2, fill ="skyblue", color ="black") +xlim(0, 100) +# Limit x-axis to 0-100 dayslabs(title ="Distribution of Length of Hospital Stay ",x ="Length of Stay (Days)",y ="Count of Admissions" ) +theme_minimal()# Convert to interactive plot with plotlyinteractive_plot <-ggplotly(plot)
Warning: Removed 417 rows containing non-finite outside the scale range
(`stat_bin()`).
# Show the plotinteractive_plot
The “Distribution of Length of Hospital Stay” histogram shows a strongly skewed distribution, with the highest frequency of admissions occurring at very short lengths of stay. The histogram indicates that the majority of admissions result in brief hospitalizations, while a long tail on the right side of the graph signifies a smaller proportion of patients requiring extended stays exceeding 50 days, likely due to complex medical needs or surgical procedures.
The high count of single admissions in the “Number of Admissions per Patient” graph directly corresponds to the high frequency of short stays in the “Distribution of Length of Hospital Stay” graph. This suggests that a large portion of single admissions are associated with brief hospitalizations.
The significant midnight peak in the “Admission Hour Distribution” graph might contribute to the high number of single admissions and short stays, as these midnight admissions could be related to administrative procedures or short-term observations.
The data recording anomalies in the “Admission Minute Distribution” graph raise concerns about the accuracy of time-related analyses, which could potentially affect the interpretation of length of stay and admission frequency data.
# Import patients.csv.gz as a tibblepatients_tble <-read_csv("~/mimic/hosp/patients.csv.gz") %>%as_tibble()
Rows: 364627 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, anchor_year_group
dbl (3): subject_id, anchor_age, anchor_year
date (1): dod
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the first few rows of the tibbleprint(patients_tble)
# A tibble: 364,627 × 6
subject_id gender anchor_age anchor_year anchor_year_group dod
<dbl> <chr> <dbl> <dbl> <chr> <date>
1 10000032 F 52 2180 2014 - 2016 2180-09-09
2 10000048 F 23 2126 2008 - 2010 NA
3 10000058 F 33 2168 2020 - 2022 NA
4 10000068 F 19 2160 2008 - 2010 NA
5 10000084 M 72 2160 2017 - 2019 2161-02-13
6 10000102 F 27 2136 2008 - 2010 NA
7 10000108 M 25 2163 2014 - 2016 NA
8 10000115 M 24 2154 2017 - 2019 NA
9 10000117 F 48 2174 2008 - 2010 NA
10 10000161 M 60 2163 2020 - 2022 NA
# ℹ 364,617 more rows
Q4.2 Summary and visualization
Summarize variables gender and anchor_age by graphics, and explain any patterns you see.
Solution 4.2
# Load necessary librarieslibrary(ggplot2)# Gender distributionggplot(patients_tble, aes(x = gender)) +geom_bar(fill ="lightblue", color ="black") +labs(title ="Distribution of Gender", x ="Gender", y ="Count") +theme_minimal()
The bar chart shows the distribution of gender. Here, the counts of females (F) and males (M) are displayed. While both genders are well-represented, there are slightly more females than males.
library(plotly)# Create the histogram plotp <-ggplot(patients_tble, aes(x = anchor_age)) +geom_histogram(binwidth =2, fill ="lightgreen", color ="black") +labs(title ="Distribution of Anchor Age", x ="Anchor Age", y ="Count") +theme_minimal()# Convert the ggplot to an interactive plotly plotggplotly(p)
The younger peak could correspond to more frequent hospitalization of adults for acute conditions, accidents, or surgeries.
The older peak reflects elderly patients often requiring ICU care for chronic conditions or age-related ailments.
The spike at the right indicates the dataset’s anonymization strategy for older patients, aggregating all patients above 89 into a single category to ensure privacy. The spike at age 89-100 likely results from an anonymization process used in MIMIC. For privacy reasons, patients over a certain age (usually around 89) are assigned an anchor age of 90+. This is done to reduce the risk of identifying older individuals who could be more easily recognized due to their age. Hence, the spike on the right represents a group of patients who are over 89 years old, but whose exact ages have been masked.
We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement. Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.
library(dplyr)library(tidyr)library(stringr)# Open the parquet datasetlabevents_tble <-open_dataset(labevents_pq_dir, format ="parquet")
labs_data <- labevents_tble |>to_duckdb() |># Select necessary variablesselect(subject_id, itemid, storetime, valuenum) |># Filter for itemid of interestfilter(itemid %in% dlabitems_tble$itemid) |># Join with icustays table to get intimeleft_join(select(icustays_tble, subject_id, stay_id, intime),by =c("subject_id"),copy =TRUE ) |># Filter to keep only records before ICU intimefilter(storetime < intime) |># Group by subject_id, stay_id, and itemidgroup_by(subject_id, stay_id, itemid) |># Keep only the last storetime for each item before intimeslice_max(storetime, n =1) |># Remove storetime and intime columns, ungroup the dataselect(-storetime, -intime) |>ungroup() |># Pivot wider to make itemid names as columnspivot_wider(names_from = itemid, values_from = valuenum) |># Rename columns based on dlabitems_tble labelsrename_at(vars(as.character(dlabitems_tble$itemid)),~str_to_lower(dlabitems_tble$label) ) |># Rename specific columns (e.g., white blood cells)rename(wbc =`white blood cells`) |># Show the query for debugging purposesshow_query() |># Collect the results into an R dataframecollect() |>relocate(subject_id,stay_id,bicarbonate,chloride,creatinine,glucose, potassium, sodium,hematocrit, wbc)|># Arrange by subject_id and stay_idarrange(subject_id, stay_id) %>%# Print the dataframe with full widthprint(labs_data,width=Inf)
<SQL>
SELECT
subject_id,
stay_id,
MAX(CASE WHEN (itemid = 50971.0) THEN valuenum END) AS potassium,
MAX(CASE WHEN (itemid = 51301.0) THEN valuenum END) AS wbc,
MAX(CASE WHEN (itemid = 50983.0) THEN valuenum END) AS sodium,
MAX(CASE WHEN (itemid = 50882.0) THEN valuenum END) AS bicarbonate,
MAX(CASE WHEN (itemid = 50931.0) THEN valuenum END) AS glucose,
MAX(CASE WHEN (itemid = 51221.0) THEN valuenum END) AS hematocrit,
MAX(CASE WHEN (itemid = 50902.0) THEN valuenum END) AS chloride,
MAX(CASE WHEN (itemid = 50912.0) THEN valuenum END) AS creatinine
FROM (
SELECT subject_id, itemid, valuenum, stay_id
FROM (
SELECT
q01.*,
RANK() OVER (PARTITION BY subject_id, stay_id, itemid ORDER BY storetime DESC) AS col01
FROM (
SELECT LHS.*, stay_id, intime
FROM (
SELECT subject_id, itemid, storetime, valuenum
FROM arrow_001
WHERE (itemid IN (50882, 50902, 50912, 50931, 50971, 50983, 51221, 51301))
) LHS
LEFT JOIN dbplyr_4x8YqJDJNX
ON (LHS.subject_id = dbplyr_4x8YqJDJNX.subject_id)
) q01
WHERE (storetime < intime)
) q01
WHERE (col01 <= 1)
) q01
GROUP BY subject_id, stay_id
# A tibble: 88,086 × 10
subject_id stay_id bicarbonate chloride creatinine glucose potassium sodium
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10000032 39553978 25 95 0.7 102 6.7 126
2 10000690 37081114 26 100 1 85 4.8 137
3 10000980 39765666 21 109 2.3 89 3.9 144
4 10001217 34592300 30 104 0.5 87 4.1 142
5 10001217 37067082 22 108 0.6 112 4.2 142
6 10001725 31205490 NA 98 NA NA 4.1 139
7 10001843 39698942 28 97 1.3 131 3.9 138
8 10001884 37510196 30 88 1.1 141 4.5 130
9 10002013 39060235 24 102 0.9 288 3.5 137
10 10002114 34672098 18 NA 3.1 95 6.5 125
hematocrit wbc
<dbl> <dbl>
1 41.1 6.9
2 36.1 7.1
3 27.3 5.3
4 37.4 5.4
5 38.1 15.7
6 NA NA
7 31.4 10.4
8 39.7 12.2
9 34.9 7.2
10 34.3 16.8
# ℹ 88,076 more rows
Q6. Vitals from charted events
chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are
We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.
Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.
Solution 6
# Load necessary tableschartevents_tble <-open_dataset("./chartevents_pq.parque", format ="parquet")
# List of vital item IDsvitals_itemids <-c(220045, # Heart rate220179, # Systolic non-invasive blood pressure220180, # Diastolic non-invasive blood pressure223761, # Body temperature (Fahrenheit)220210# Respiratory rate)# Filter `chartevents` for the relevant itemids and join with `icustays_tble`vitals_data <- chartevents_tble |>to_duckdb() |># Select required variablesselect(subject_id, stay_id, itemid, charttime, value,valuenum,storetime) |># Filter for vitals itemids of interestfilter(itemid %in% vitals_itemids) |># Join with icustays table to get intimeleft_join(select(icustays_tble, subject_id, stay_id, intime,outtime),by =c("subject_id", "stay_id"),copy =TRUE ) |>filter(storetime >= intime & storetime <= outtime) |>filter(!is.na(valuenum)) |># Group by subject, stay, and itemidgroup_by(subject_id, stay_id, itemid) |>slice_min(storetime, with_ties =TRUE) |># Calculate the average value for each itemid during the ICU staysummarise(mean =mean(valuenum, na.rm =TRUE)) |># Ungroupungroup() |># Pivot to wider format to make itemid names as columnspivot_wider(names_from = itemid, values_from = mean) |># Rename columns to meaningful namesrename(heart_rate =`220045`,non_invasive_bloodpressue_systolic =`220179`,non_invasive_bloodpressure_diastolic =`220180`,temperature_farhenheit =`223761`,respiratory_rate =`220210` ) |>mutate(heart_rate =round(heart_rate, 1),non_invasive_bloodpressue_systolic =round(non_invasive_bloodpressue_systolic, 1),non_invasive_bloodpressure_diastolic =round(non_invasive_bloodpressure_diastolic, 1),temperature_farhenheit =round(temperature_farhenheit, 1),respiratory_rate =round(respiratory_rate, 1) ) |># Collect the resultscollect() |>relocate(subject_id,stay_id,heart_rate,non_invasive_bloodpressure_diastolic, non_invasive_bloodpressue_systolic, respiratory_rate,temperature_farhenheit)|># Arrange by subject_id and stay_idarrange(subject_id, stay_id) |># Print with full widthprint(width =Inf)
`summarise()` has grouped output by "subject_id" and "stay_id". You can
override using the `.groups` argument.
`summarise()` has grouped output by "subject_id" and "stay_id". You can
override using the `.groups` argument.
Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables
all variables in icustays_tble
all variables in admissions_tble
all variables in patients_tble
the last lab measurements before the ICU stay in labevents_tble
the first vital measurements during the ICU stay in chartevents_tble
The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.
Solution 7
mimic_icu_cohort <- icustays_tble %>%left_join(patients,by ="subject_id") %>%left_join(admissions , by =c("hadm_id","subject_id")) %>%left_join(vitals_data,by =c("subject_id","stay_id")) %>%left_join(labs_data,by=c("subject_id","stay_id"))%>%mutate(intime_year =year(intime), # Extract year from ICU admission timeage_intime = anchor_age + (intime_year - anchor_year) ) %>%# Optionally remove the intermediate intime_year columnselect(-intime_year)print(mimic_icu_cohort,width=Inf)
# A tibble: 94,458 × 41
subject_id hadm_id stay_id first_careunit
<dbl> <dbl> <dbl> <chr>
1 10000032 29079034 39553978 Medical Intensive Care Unit (MICU)
2 10000690 25860671 37081114 Medical Intensive Care Unit (MICU)
3 10000980 26913865 39765666 Medical Intensive Care Unit (MICU)
4 10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)
5 10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)
6 10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
7 10001843 26133978 39698942 Medical/Surgical Intensive Care Unit (MICU/SICU)
8 10001884 26184834 37510196 Medical Intensive Care Unit (MICU)
9 10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)
10 10002114 27793700 34672098 Coronary Care Unit (CCU)
last_careunit intime
<chr> <dttm>
1 Medical Intensive Care Unit (MICU) 2180-07-23 14:00:00
2 Medical Intensive Care Unit (MICU) 2150-11-02 19:37:00
3 Medical Intensive Care Unit (MICU) 2189-06-27 08:42:00
4 Surgical Intensive Care Unit (SICU) 2157-11-20 19:18:02
5 Surgical Intensive Care Unit (SICU) 2157-12-19 15:42:24
6 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
7 Medical/Surgical Intensive Care Unit (MICU/SICU) 2134-12-05 18:50:03
8 Medical Intensive Care Unit (MICU) 2131-01-11 04:20:05
9 Cardiac Vascular Intensive Care Unit (CVICU) 2160-05-18 10:00:53
10 Coronary Care Unit (CCU) 2162-02-17 23:30:00
outtime los gender anchor_age anchor_year anchor_year_group
<dttm> <dbl> <chr> <int> <int> <chr>
1 2180-07-23 23:50:47 0.410 F 52 2180 2014 - 2016
2 2150-11-06 17:03:17 3.89 F 86 2150 2008 - 2010
3 2189-06-27 20:38:27 0.498 F 73 2186 2008 - 2010
4 2157-11-21 22:08:00 1.12 F 55 2157 2011 - 2013
5 2157-12-20 14:27:41 0.948 F 55 2157 2011 - 2013
6 2110-04-12 23:59:56 1.34 F 46 2110 2011 - 2013
7 2134-12-06 14:38:26 0.825 M 73 2131 2017 - 2019
8 2131-01-20 08:27:30 9.17 F 68 2122 2008 - 2010
9 2160-05-19 17:33:33 1.31 F 53 2156 2008 - 2010
10 2162-02-20 21:16:27 2.91 M 56 2162 2020 - 2022
dod admittime dischtime deathtime
<chr> <chr> <chr> <chr>
1 "2180-09-09" 2180-07-23 12:35:00 2180-07-25 17:55:00 ""
2 "2152-01-30" 2150-11-02 18:02:00 2150-11-12 13:45:00 ""
3 "2193-08-26" 2189-06-27 07:38:00 2189-07-03 03:00:00 ""
4 "" 2157-11-18 22:56:00 2157-11-25 18:00:00 ""
5 "" 2157-12-18 16:58:00 2157-12-24 14:55:00 ""
6 "" 2110-04-11 15:08:00 2110-04-14 15:00:00 ""
7 "2134-12-06" 2134-12-05 00:10:00 2134-12-06 12:54:00 "2134-12-06 12:54:00"
8 "2131-01-20" 2131-01-07 20:39:00 2131-01-20 05:15:00 "2131-01-20 05:15:00"
9 "" 2160-05-18 07:45:00 2160-05-23 13:30:00 ""
10 "2162-12-11" 2162-02-17 22:32:00 2162-03-04 15:16:00 ""
admission_type admit_provider_id admission_location
<chr> <chr> <chr>
1 EW EMER. P06OTX EMERGENCY ROOM
2 EW EMER. P26QQ4 EMERGENCY ROOM
3 EW EMER. P06OTX EMERGENCY ROOM
4 EW EMER. P3610N EMERGENCY ROOM
5 DIRECT EMER. P276OU PHYSICIAN REFERRAL
6 EW EMER. P32W56 PACU
7 URGENT P67ATB TRANSFER FROM HOSPITAL
8 OBSERVATION ADMIT P49AFC EMERGENCY ROOM
9 SURGICAL SAME DAY ADMISSION P8286C PHYSICIAN REFERRAL
10 OBSERVATION ADMIT P46834 PHYSICIAN REFERRAL
discharge_location insurance language marital_status race
<chr> <chr> <chr> <chr> <chr>
1 HOME Medicaid English "WIDOWED" WHITE
2 REHAB Medicare English "WIDOWED" WHITE
3 HOME HEALTH CARE Medicare English "MARRIED" BLACK/AFRICAN AMERICAN
4 HOME HEALTH CARE Private Other "MARRIED" WHITE
5 HOME HEALTH CARE Private Other "MARRIED" WHITE
6 HOME Private English "MARRIED" WHITE
7 DIED Medicare English "SINGLE" WHITE
8 DIED Medicare English "MARRIED" BLACK/AFRICAN AMERICAN
9 HOME HEALTH CARE Medicare English "SINGLE" OTHER
10 HOME HEALTH CARE Medicaid English "" UNKNOWN
edregtime edouttime hospital_expire_flag heart_rate
<chr> <chr> <int> <dbl>
1 "2180-07-23 05:54:00" "2180-07-23 14:00:00" 0 91
2 "2150-11-02 11:41:00" "2150-11-02 19:37:00" 0 78
3 "2189-06-27 06:25:00" "2189-06-27 08:42:00" 0 76
4 "2157-11-18 17:38:00" "2157-11-19 01:24:00" 0 86
5 "" "" 0 79.3
6 "" "" 0 86
7 "" "" 1 124.
8 "2131-01-07 13:36:00" "2131-01-07 22:13:00" 1 49
9 "" "" 0 80
10 "2162-02-17 19:35:00" "2162-02-17 23:30:00" 0 110.
non_invasive_bloodpressure_diastolic non_invasive_bloodpressue_systolic
<dbl> <dbl>
1 48 84
2 56.5 106
3 102 154
4 90 151
5 93.3 156
6 56 73
7 78 110
8 30.5 174.
9 62 98.5
10 80 112
respiratory_rate temperature_farhenheit bicarbonate chloride creatinine
<dbl> <dbl> <dbl> <dbl> <dbl>
1 24 98.7 25 95 0.7
2 24.3 97.7 26 100 1
3 23.5 98 21 109 2.3
4 18 98.5 22 108 0.6
5 14 97.6 30 104 0.5
6 19 97.7 NA 98 NA
7 16.5 97.9 28 97 1.3
8 13 98.1 30 88 1.1
9 14 97.2 24 102 0.9
10 21 97.9 18 NA 3.1
glucose potassium sodium hematocrit wbc age_intime
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 102 6.7 126 41.1 6.9 52
2 85 4.8 137 36.1 7.1 86
3 89 3.9 144 27.3 5.3 76
4 112 4.2 142 38.1 15.7 55
5 87 4.1 142 37.4 5.4 55
6 NA 4.1 139 NA NA 46
7 131 3.9 138 31.4 10.4 76
8 141 4.5 130 39.7 12.2 77
9 288 3.5 137 34.9 7.2 57
10 95 6.5 125 34.3 16.8 56
# ℹ 94,448 more rows
Q8. Exploratory data analysis (EDA)
Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:
Solution 8.1
Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)
# A tibble: 1,078 × 8
race insurance marital_status gender mean_los median_los sd_los IQR_los
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AMERICAN … "" "DIVORCED" M 2.44 2.44 NA 0
2 AMERICAN … "" "MARRIED" M 6.30 6.30 6.52 4.61
3 AMERICAN … "Medicai… "" F 16.9 16.9 NA 0
4 AMERICAN … "Medicai… "DIVORCED" F 13.4 13.4 17.8 12.6
5 AMERICAN … "Medicai… "DIVORCED" M 2.41 1.22 2.00 2.59
6 AMERICAN … "Medicai… "MARRIED" F 2.08 2.08 1.29 0.914
7 AMERICAN … "Medicai… "MARRIED" M 8.80 8.25 8.39 10.6
8 AMERICAN … "Medicai… "SINGLE" F 1.45 1.33 0.738 1.00
9 AMERICAN … "Medicai… "SINGLE" M 3.07 1.06 4.45 1.11
10 AMERICAN … "Medicar… "" F 3.12 4.11 1.74 1.51
# ℹ 1,068 more rows
# Plot 1: Boxplot of los vs raceggplot(mimic_icu_cohort, aes(x = race, y = los)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Length of ICU Stay (LOS) by Race", x ="Race", y ="LOS (days)")
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_boxplot()`).
# Plot 2: Boxplot of los vs insuranceggplot(mimic_icu_cohort, aes(x = insurance, y = los)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Length of ICU Stay (LOS) by Insurance", x ="Insurance", y ="LOS (days)")
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_boxplot()`).
# Plot 3: Boxplot of los vs marital statusggplot(mimic_icu_cohort, aes(x = marital_status, y = los)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Length of ICU Stay (LOS) by Marital Status", x ="Marital Status", y ="LOS (days)")
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_boxplot()`).
# Plot 4: Boxplot of los vs genderggplot(mimic_icu_cohort, aes(x = gender, y = los)) +geom_boxplot() +labs(title ="Length of ICU Stay (LOS) by Gender", x ="Gender", y ="LOS (days)")
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_boxplot()`).
# Plot 5: Scatter plot of los vs age at intime (age_intime)ggplot(mimic_icu_cohort, aes(x = age_intime, y = los)) +geom_point(color="lightpink",alpha =0.5) +labs(title ="Length of ICU Stay (LOS) by Age at Admission",x ="Age at Admission (years)", y ="LOS (days)")
Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
Solution 8.2
Length of ICU stay los vs the last available lab measurements before ICU stay
# Load necessary librarieslibrary(ggplot2)library(dplyr)# Binning lab measurements for box plot representationmimic_icu_cohort <- mimic_icu_cohort %>%mutate(glucose_bin =cut(glucose, breaks =5),sodium_bin =cut(sodium, breaks =5),potassium_bin =cut(potassium, breaks =5),wbc_bin =cut(wbc, breaks =5),creatinine_bin =cut(creatinine, breaks =5))# Plot 1: Boxplot of LOS vs Glucose binsggplot(mimic_icu_cohort, aes(x = glucose_bin, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se =FALSE, color ="blue") +labs(title ="LOS vs Glucose", x ="Glucose Bins (mg/dL)",y ="Length of ICU Stay (days)") +theme(axis.text.x =element_text(angle =45, hjust =1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
# Plot 2: Boxplot of LOS vs Sodium binsggplot(mimic_icu_cohort, aes(x = sodium_bin, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se =FALSE, color ="blue") +labs(title ="LOS vs Sodium", x ="Sodium Bins (mEq/L)",y ="Length of ICU Stay (days)") +theme(axis.text.x =element_text(angle =45, hjust =1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
# Plot 3: Boxplot of LOS vs Potassium binsggplot(mimic_icu_cohort, aes(x = potassium_bin, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se =FALSE, color ="blue") +labs(title ="LOS vs Potassium", x ="Potassium Bins (mEq/L)",y ="Length of ICU Stay (days)") +theme(axis.text.x =element_text(angle =45, hjust =1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
# Plot 4: Boxplot of LOS vs WBC binsggplot(mimic_icu_cohort, aes(x = wbc_bin, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se =FALSE, color ="blue") +labs(title ="LOS vs WBC", x ="WBC Bins (K/uL)", y ="Length of ICU Stay (days)") +theme(axis.text.x =element_text(angle =45, hjust =1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
# Plot 5: Boxplot of LOS vs Creatinine binsggplot(mimic_icu_cohort, aes(x = creatinine_bin, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se =FALSE, color ="blue") +labs(title ="LOS vs Creatinine", x ="Creatinine Bins (mg/dL)",y ="Length of ICU Stay (days)") +theme(axis.text.x =element_text(angle =45, hjust =1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
Solution 8.3
Length of ICU stay los vs the first vital measurements within the ICU stay
# Load necessary librarieslibrary(ggplot2)# Plot 1: Scatter plot of LOS vs Heart Rateggplot(mimic_icu_cohort, aes(x = heart_rate, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se =FALSE, color ="blue") +labs(title ="LOS vs Heart Rate", x ="Heart Rate (bpm)",y ="Length of ICU Stay (days)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 100 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 100 rows containing missing values or values outside the scale range
(`geom_point()`).
# Plot 2: Scatter plot of LOS vs Systolic Blood Pressureggplot(mimic_icu_cohort, aes(x = non_invasive_bloodpressue_systolic, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se=FALSE, color ="blue") +labs(title ="LOS vs Systolic Blood Pressure", x ="Systolic BP (mmHg)",y ="Length of ICU Stay (days)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1384 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 1384 rows containing missing values or values outside the scale range
(`geom_point()`).
# Plot 3: Scatter plot of LOS vs Diastolic Blood Pressureggplot(mimic_icu_cohort, aes(x = non_invasive_bloodpressure_diastolic, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se=FALSE, color ="blue") +labs(title ="LOS vs Diastolic Blood Pressure",x ="Diastolic BP (mmHg)",y ="Length of ICU Stay (days)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1389 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 1389 rows containing missing values or values outside the scale range
(`geom_point()`).
# Plot 4: Scatter plot of LOS vs Body Temperatureggplot(mimic_icu_cohort, aes(x = temperature_farhenheit, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se=FALSE, color ="blue") +labs(title ="LOS vs Body Temperature", x ="Body Temperature (°C)",y ="Length of ICU Stay (days)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1689 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 1689 rows containing missing values or values outside the scale range
(`geom_point()`).
# Plot 5: Scatter plot of LOS vs Respiratory Rateggplot(mimic_icu_cohort, aes(x = respiratory_rate, y = los)) +geom_point(color="lightblue",alpha =0.6) +geom_smooth(method ="lm",se=FALSE, color ="blue") +labs(title ="LOS vs Respiratory Rate", x ="Respiratory Rate (breaths/min)",y ="Length of ICU Stay (days)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 212 rows containing missing values or values outside the scale range
(`geom_point()`).
Solution 8.4
Length of ICU stay los vs first ICU unit
ggplot(mimic_icu_cohort, aes(x = los, y = first_careunit)) +geom_boxplot(color="black",fill="lightpink",alpha =0.6) +labs(title ="LOS vs first ICU unit", x ="Respiratory Rate (breaths/min)", y ="Length of ICU Stay (days)") +theme_minimal()
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_boxplot()`).